Machine Learning
MSE FTP MachLe
Christoph Würsch

ML12 A5 Detecting similar Faces using DBSCAN?¶

The labelled faces dataset of sckit-learn contains gray scale images of 62 differnet famous personalites from politics. In this exercise, we assume that there are no target labels, i.e. the names of the persons are unknown. We want to find a method to cluster similar images. This can be done using a dimensionality reduction algorithm like PCA for feature generation and a subsequent clustering e.g. using DBSCAN.

In [1]:
%matplotlib inline
from IPython.display import set_matplotlib_formats, display
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from cycler import cycler
plt.rcParams['image.cmap'] = "gray"
In [2]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

(a) Loading the Faces Dataset¶

Open the Jupyter notebook DBSCAN_DetectSimilarFaces.jpynb and have a look at the first few faces of the dataset. Not every person is represented equally frequent in this unbalanced dataset. For classification, we would have to take this into account. We extract the first 50 images of each person and put them into a flat array called X_people. The correspinding targets (y-values, names), are stored in the y_people array.

In [3]:
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_lfw_people
people = fetch_lfw_people(min_faces_per_person=20, resize=2)
image_shape = people.images[0].shape

fig, axes = plt.subplots(2, 5, figsize=(15, 8),
                         subplot_kw={'xticks': (), 'yticks': ()})
for target, image, ax in zip(people.target, people.images, axes.ravel()):
    ax.imshow(image)
    ax.set_title(people.target_names[target])
In [4]:
np.shape(people.images)
Out[4]:
(3023, 250, 188)
In [5]:
left = 5
top = 5
right = image_shape[1]-left
bottom = image_shape[0]-top

import PIL

for img in people.images[0:3,:,:]:
    #pil_img = PIL.Image.fromarray(np.uint8(img*255))
    pil_img = PIL.Image.fromarray(img)
    plt.figure()
    plt.imshow(np.array(pil_img))
    pil_img = pil_img.crop((left, top, right, bottom))
    plt.figure()
    plt.imshow(np.array(pil_img))

np.shape(np.array(pil_img))
Out[5]:
(240, 178)

https://github.com/scikit-learn/scikit-learn/issues/24942

In [6]:
print("people.images.shape: {}".format(people.images.shape))
print("Number of classes: {}".format(len(people.target_names)))
people.images.shape: (3023, 250, 188)
Number of classes: 62
In [7]:
people.target_names
Out[7]:
array(['Alejandro Toledo', 'Alvaro Uribe', 'Amelie Mauresmo',
       'Andre Agassi', 'Angelina Jolie', 'Ariel Sharon',
       'Arnold Schwarzenegger', 'Atal Bihari Vajpayee', 'Bill Clinton',
       'Carlos Menem', 'Colin Powell', 'David Beckham', 'Donald Rumsfeld',
       'George Robertson', 'George W Bush', 'Gerhard Schroeder',
       'Gloria Macapagal Arroyo', 'Gray Davis', 'Guillermo Coria',
       'Hamid Karzai', 'Hans Blix', 'Hugo Chavez', 'Igor Ivanov',
       'Jack Straw', 'Jacques Chirac', 'Jean Chretien',
       'Jennifer Aniston', 'Jennifer Capriati', 'Jennifer Lopez',
       'Jeremy Greenstock', 'Jiang Zemin', 'John Ashcroft',
       'John Negroponte', 'Jose Maria Aznar', 'Juan Carlos Ferrero',
       'Junichiro Koizumi', 'Kofi Annan', 'Laura Bush',
       'Lindsay Davenport', 'Lleyton Hewitt', 'Luiz Inacio Lula da Silva',
       'Mahmoud Abbas', 'Megawati Sukarnoputri', 'Michael Bloomberg',
       'Naomi Watts', 'Nestor Kirchner', 'Paul Bremer', 'Pete Sampras',
       'Recep Tayyip Erdogan', 'Ricardo Lagos', 'Roh Moo-hyun',
       'Rudolph Giuliani', 'Saddam Hussein', 'Serena Williams',
       'Silvio Berlusconi', 'Tiger Woods', 'Tom Daschle', 'Tom Ridge',
       'Tony Blair', 'Vicente Fox', 'Vladimir Putin', 'Winona Ryder'],
      dtype='<U25')
In [8]:
# count how often each target appears
counts = np.bincount(people.target)
# print counts next to target names:
for i, (count, name) in enumerate(zip(counts, people.target_names)):
    print("{0:25} {1:3}".format(name, count), end='   ')
    if (i + 1) % 3 == 0:
        print()
Alejandro Toledo           39   Alvaro Uribe               35   Amelie Mauresmo            21   
Andre Agassi               36   Angelina Jolie             20   Ariel Sharon               77   
Arnold Schwarzenegger      42   Atal Bihari Vajpayee       24   Bill Clinton               29   
Carlos Menem               21   Colin Powell              236   David Beckham              31   
Donald Rumsfeld           121   George Robertson           22   George W Bush             530   
Gerhard Schroeder         109   Gloria Macapagal Arroyo    44   Gray Davis                 26   
Guillermo Coria            30   Hamid Karzai               22   Hans Blix                  39   
Hugo Chavez                71   Igor Ivanov                20   Jack Straw                 28   
Jacques Chirac             52   Jean Chretien              55   Jennifer Aniston           21   
Jennifer Capriati          42   Jennifer Lopez             21   Jeremy Greenstock          24   
Jiang Zemin                20   John Ashcroft              53   John Negroponte            31   
Jose Maria Aznar           23   Juan Carlos Ferrero        28   Junichiro Koizumi          60   
Kofi Annan                 32   Laura Bush                 41   Lindsay Davenport          22   
Lleyton Hewitt             41   Luiz Inacio Lula da Silva  48   Mahmoud Abbas              29   
Megawati Sukarnoputri      33   Michael Bloomberg          20   Naomi Watts                22   
Nestor Kirchner            37   Paul Bremer                20   Pete Sampras               22   
Recep Tayyip Erdogan       30   Ricardo Lagos              27   Roh Moo-hyun               32   
Rudolph Giuliani           26   Saddam Hussein             23   Serena Williams            52   
Silvio Berlusconi          33   Tiger Woods                23   Tom Daschle                25   
Tom Ridge                  33   Tony Blair                144   Vicente Fox                32   
Vladimir Putin             49   Winona Ryder               24   
In [9]:
87*65
Out[9]:
5655
In [10]:
mask = np.zeros(people.target.shape, dtype=bool)
for target in np.unique(people.target):
    mask[np.where(people.target == target)[0][:50]] = 1

X_people = people.data[mask]
y_people = people.target[mask]

# scale the grey-scale values to be between 0 and 1
# instead of 0 and 255 for better numeric stability:
X_people = X_people / 255.
In [11]:
NumberOfPeople=np.unique(people.target).shape[0]
TargetNames  = [];
n=5

#find the first 5 images from each person
fig, axes = plt.subplots(12, 5, figsize=(15, 30),
                         subplot_kw={'xticks': (), 'yticks': ()})

for target,ax in zip(np.unique(people.target),axes.ravel()):
    #get the first n pictures from each person
    indices=np.where(people.target == target)[0][1:n+1]
    TargetNames.append(people.target_names[target])

    image=people.images[indices[0]]
    ax.imshow(image)
    ax.set_title(str(target)+': '+TargetNames[target])

(b) Principal Component Analysis¶

Apply now a principal component analysis X_pca=pca.fit_transform(X_people) and extract the first 100 components of each image. Reconstruct the first 10 entries of the dataset using the 100 components of the PCA transformed data by applying the pca.inverse_transform method and reshaping the image to the original size using np.reshape.

What is the minimum number of components necessary such that you recognize the persons? Try it out.

In [12]:
NumberOfPeople
Out[12]:
62
In [13]:
#extract eigenfaces from lfw data and transform data
from sklearn.decomposition import PCA
pca = PCA(n_components=100, whiten=True, random_state=0)
X_pca = pca.fit_transform(X_people)
#X_pca = pca.transform(X_people)

image_shape = people.images[0].shape
NumberOfSamples=X_pca.shape[0]

fig, axes = plt.subplots(2, 5, figsize=(15, 8),
                         subplot_kw={'xticks': (), 'yticks': ()})

for ix, target, ax in zip(np.arange(NumberOfSamples), y_people, axes.ravel()):
    image=np.reshape(pca.inverse_transform(X_pca[ix,:]),image_shape)
    ax.imshow(image)
    ax.set_title(str(y_people[ix])+': '+people.target_names[target])

(c) Apply DBSCAN on these features¶

Import DBSCAN class from sklearn.cluster, generate an instance called dbscan and apply it to the pca transformed data X_pca and extract the cluster labels using labels = dbscan.fit_predict(X_pca). Use first the standard parameters for the method and check how many unique clusters the algorithm could find by analyzing the number of unique entries in the predicted cluster labels.

In [14]:
# apply DBSCAN with default parameters
from sklearn.cluster import DBSCAN
dbscan = DBSCAN()
labels = dbscan.fit_predict(X_pca)
print("Unique labels: {}".format(np.unique(labels)))
Unique labels: [-1]

(d) Variation of the eps parameter¶

Change the parameter eps of the dbscan using dbscan(min_samples=3, eps=5). Change the value of eps in the range from 6 to 8 in small steps and check for each value of eps how many clusters could be determined. Save the labels from the clustering that yields the largest number of clusters.

In [20]:
max_clust = 0
labels_max = [] # for storing the labels from the clustering with the largest number of clusters
for eps in np.linspace(6,8,51):
    print("\neps={}".format(eps))
    dbscan = DBSCAN(eps=eps, min_samples=3)
    labels = dbscan.fit_predict(X_pca)
    if max_clust < len(np.unique(labels)):
      max_clust = len(np.unique(labels))
      labels_max = labels
    print("Number of clusters: {}".format(len(np.unique(labels))))
    print("Cluster sizes: {}".format(np.bincount(labels + 1)))
eps=6.0
Number of clusters: 4
Cluster sizes: [2050    3    7    3]

eps=6.04
Number of clusters: 4
Cluster sizes: [2048    3    7    5]

eps=6.08
Number of clusters: 4
Cluster sizes: [2048    3    7    5]

eps=6.12
Number of clusters: 4
Cluster sizes: [2048    3    7    5]

eps=6.16
Number of clusters: 5
Cluster sizes: [2045    3    7    3    5]

eps=6.2
Number of clusters: 6
Cluster sizes: [2042    3    7    3    3    5]

eps=6.24
Number of clusters: 6
Cluster sizes: [2042    3    7    3    3    5]

eps=6.28
Number of clusters: 6
Cluster sizes: [2042    3    7    3    3    5]

eps=6.32
Number of clusters: 6
Cluster sizes: [2042    3    7    3    3    5]

eps=6.36
Number of clusters: 6
Cluster sizes: [2042    3    7    3    3    5]

eps=6.4
Number of clusters: 6
Cluster sizes: [2041    3    7    3    6    3]

eps=6.44
Number of clusters: 7
Cluster sizes: [2036    3    4    7    6    3    4]

eps=6.48
Number of clusters: 7
Cluster sizes: [2032    3    5    7    7    6    3]

eps=6.52
Number of clusters: 7
Cluster sizes: [2030    3    7    7    7    6    3]

eps=6.5600000000000005
Number of clusters: 7
Cluster sizes: [2028    3    8    7    8    6    3]

eps=6.6
Number of clusters: 7
Cluster sizes: [2026    3    6    8   10    7    3]

eps=6.64
Number of clusters: 8
Cluster sizes: [2018    4   13    8   10    4    3    3]

eps=6.68
Number of clusters: 7
Cluster sizes: [2010    4   13   25    5    3    3]

eps=6.72
Number of clusters: 7
Cluster sizes: [2009    4   13   26    5    3    3]

eps=6.76
Number of clusters: 6
Cluster sizes: [2005   36   13    3    3    3]

eps=6.8
Number of clusters: 7
Cluster sizes: [1999   38   14    3    3    3    3]

eps=6.84
Number of clusters: 6
Cluster sizes: [1991   49   14    3    3    3]

eps=6.88
Number of clusters: 8
Cluster sizes: [1985   49   14    3    3    3    3    3]

eps=6.92
Number of clusters: 7
Cluster sizes: [1975   61   14    3    4    3    3]

eps=6.96
Number of clusters: 8
Cluster sizes: [1962   68    4   14    6    3    3    3]

eps=7.0
Number of clusters: 7
Cluster sizes: [1954   77    7    4   14    4    3]

eps=7.04
Number of clusters: 9
Cluster sizes: [1943   82    7    4   14    3    4    3    3]

eps=7.08
Number of clusters: 10
Cluster sizes: [1929   93    7    4   14    3    4    3    3    3]

eps=7.12
Number of clusters: 7
Cluster sizes: [1924  118    7    3    4    4    3]

eps=7.16
Number of clusters: 6
Cluster sizes: [1916  134    3    3    4    3]

eps=7.2
Number of clusters: 4
Cluster sizes: [1909  147    3    4]

eps=7.24
Number of clusters: 5
Cluster sizes: [1893  158    3    5    4]

eps=7.28
Number of clusters: 4
Cluster sizes: [1887  165    7    4]

eps=7.32
Number of clusters: 4
Cluster sizes: [1877  174    7    5]

eps=7.36
Number of clusters: 5
Cluster sizes: [1863  185    3    7    5]

eps=7.4
Number of clusters: 4
Cluster sizes: [1857  196    3    7]

eps=7.4399999999999995
Number of clusters: 6
Cluster sizes: [1842  205    3    3    7    3]

eps=7.48
Number of clusters: 6
Cluster sizes: [1828  219    3    3    7    3]

eps=7.52
Number of clusters: 4
Cluster sizes: [1817  240    3    3]

eps=7.5600000000000005
Number of clusters: 5
Cluster sizes: [1799  254    3    3    4]

eps=7.6
Number of clusters: 5
Cluster sizes: [1788  265    3    3    4]

eps=7.640000000000001
Number of clusters: 4
Cluster sizes: [1774  283    3    3]

eps=7.68
Number of clusters: 4
Cluster sizes: [1755  302    3    3]

eps=7.72
Number of clusters: 5
Cluster sizes: [1746  308    3    3    3]

eps=7.76
Number of clusters: 4
Cluster sizes: [1739  318    3    3]

eps=7.8
Number of clusters: 5
Cluster sizes: [1722  330    4    3    4]

eps=7.84
Number of clusters: 5
Cluster sizes: [1701  351    4    3    4]

eps=7.88
Number of clusters: 5
Cluster sizes: [1691  361    4    3    4]

eps=7.92
Number of clusters: 4
Cluster sizes: [1684  372    3    4]

eps=7.96
Number of clusters: 4
Cluster sizes: [1672  384    4    3]

eps=8.0
Number of clusters: 3
Cluster sizes: [1655  405    3]
10

(e) Maxumum number of clusters found¶

Plot the members of the clusters with less than 10 samples from the clustering with largest number of clusters using the follwing python code.

In [26]:
# the labels from the clustering with the largest number of clusters
labels = labels_max

for cluster in range(max(labels) + 1):
    mask = labels == cluster
    n_images =  np.sum(mask)
    print("Cluster number: {}".format(cluster))
    print("Cluster size: {}".format(n_images))
    if n_images<11:
        fig, axes = plt.subplots(1, n_images, figsize=(n_images * 1.5, 4),
                             subplot_kw={'xticks': (), 'yticks': ()})
        for image, label, ax in zip(X_people[mask], y_people[mask], axes):
            ax.imshow(image.reshape(image_shape))
            ax.set_title(people.target_names[label].split()[-1])
Cluster number: 0
Cluster size: 93
Cluster number: 1
Cluster size: 7
Cluster number: 2
Cluster size: 4
Cluster number: 3
Cluster size: 14
Cluster number: 4
Cluster size: 3
Cluster number: 5
Cluster size: 4
Cluster number: 6
Cluster size: 3
Cluster number: 7
Cluster size: 3
Cluster number: 8
Cluster size: 3

Bonus: Agglomerative and Spectral Clustering (optional)¶

In [30]:
# %% using other cluster algorithms learner on the pca transformed data
from time import time
from sklearn import cluster
from sklearn.neighbors import kneighbors_graph

n_clusters=14

clustering_names = ['SpectralClustering', 'Ward', 'AverageLinkage']

connectivity = kneighbors_graph(X_pca, n_neighbors=n_clusters, include_self=False)
# make connectivity symmetric
connectivity = 0.5 * (connectivity + connectivity.T)

spectral = cluster.SpectralClustering(n_clusters=n_clusters,
                                          eigen_solver='arpack',
                                          affinity="nearest_neighbors")

ward = cluster.AgglomerativeClustering(n_clusters=n_clusters, linkage='ward',
                                           connectivity=connectivity)

average_linkage = cluster.AgglomerativeClustering(
        linkage="average", affinity="cityblock", n_clusters=n_clusters,
        connectivity=connectivity)


clustering_algorithms = [spectral, ward, average_linkage]

# %matplotlib inline
for name, algorithm in zip(clustering_names, clustering_algorithms):
    # predict cluster memberships
    print(algorithm)
    t0 = time()
    algorithm.fit(X_pca)
    t1 = time()

    if hasattr(algorithm, 'labels_'):
        labels = algorithm.labels_.astype(int)
    else:
        labels = algorithm.predict(X_pca)

    print("%s: %.2g sec" % (name,t1 - t0))
    print('labels found: %i' % (max(labels) + 1))
    print("_____________________________________________")
    print("       %s                                     " % (name))
    print("_____________________________________________")

    for cluster in range(max(labels) + 1):
        mask = labels == cluster
        ind=np.where(mask==True)[0]
        n_images = np.size(ind)
        submask=np.zeros(X_pca.shape[0])
        submask=submask.astype(dtype=bool)
        submask[ind]=True
        #n_images =  np.sum(mask)
        #print(n_images)

        max_image=np.min([n_images,8])
        print('max image: %i\n' % (max_image))
        fig, axes = plt.subplots(1, max_image, figsize=(max_image * 3, 3),
                         subplot_kw={'xticks': (), 'yticks': ()})

        if max_image==1:
            print(ind[0])
            image=X_people[ind[0]]
            label=y_people[ind[0]]
            plt.imshow(image.reshape(image_shape))
            plt.title(people.target_names[label].split()[-1])
        else:
            for image, label, ax in zip(X_people[submask], y_people[submask], axes):
                ax.imshow(image.reshape(image_shape))
                ax.set_title(people.target_names[label].split()[-1])

        plt.show()
SpectralClustering(affinity='nearest_neighbors', eigen_solver='arpack',
                   n_clusters=14)
SpectralClustering: 3 sec
labels found: 14
_____________________________________________
       SpectralClustering                                     
_____________________________________________
max image: 8

max image: 8

max image: 8

max image: 8

max image: 8

max image: 8

max image: 8

max image: 7

max image: 8

max image: 8

max image: 8

max image: 8

max image: 8

max image: 8

AgglomerativeClustering(connectivity=<2063x2063 sparse matrix of type '<class 'numpy.float64'>'
	with 54374 stored elements in Compressed Sparse Row format>,
                        n_clusters=14)
Ward: 1.1 sec
labels found: 14
_____________________________________________
       Ward                                     
_____________________________________________
max image: 8

max image: 8

max image: 8

max image: 8

max image: 8

max image: 8

max image: 8

max image: 8

max image: 8

max image: 8

max image: 8

max image: 8

max image: 8

max image: 8

AgglomerativeClustering(affinity='cityblock',
                        connectivity=<2063x2063 sparse matrix of type '<class 'numpy.float64'>'
	with 54374 stored elements in Compressed Sparse Row format>,
                        linkage='average', n_clusters=14)
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_agglomerative.py:983: FutureWarning: Attribute `affinity` was deprecated in version 1.2 and will be removed in 1.4. Use `metric` instead
  warnings.warn(
AverageLinkage: 6.4 sec
labels found: 14
_____________________________________________
       AverageLinkage                                     
_____________________________________________
max image: 8

max image: 1

1989
max image: 2

max image: 1

1606
max image: 1

1496
max image: 1

1713
max image: 1

1881
max image: 1

627
max image: 1

1219
max image: 1

661
max image: 1

63
max image: 1

1543
max image: 1

1507
max image: 1

1090
In [ ]: